Robust tests for combining p-values under arbitrary dependency structures

Recently Liu and Xie proposed a p-value combination test based on the Cauchy distribution (CCT). They showed that when the significance levels are small, CCT can control type I error rate and the resulting p-value can be simply approximated using a Cauchy distribution. One very special and attractive property of CCT is that it is applicable to situations where the p-values to be combined are dependent. However, in this paper, we show that under some conditions the commonly used MinP test is much more powerful than CCT. In addition, under some other situations, CCT is powerless at all. Therefore, we should use CCT with caution. We also proposed new robust p-value combination tests using a second MinP/CCT to combine the dependent p-values obtained from CCT and MinP applied to the original p-values. We call the new tests MinP-CCT-MinP (MCM) and CCT-MinP-CCT (CMC). We study the performance of the new tests by comparing them with CCT and MinP using comprehensive simulation study. Our study shows that the proposed tests, MCM and CMC, are robust and powerful under many conditions, and can be considered as alternatives of CCT or MinP.

The CCT is constructed using the following test statistic 15 : T = k i=1 w i T i , where w i ≥ 0 are the weights satisfying k i=1 w i = 1 . And the p-value from the CCT is calculated as p CCT = P[C(0, 1) ≥ t] , where t = k i=1 w i t i is the observed test statistic of T . For the CCT, we have the following new results. (1) , and T = k i=1 w i T i ≥ k i=1 w i T (k) = T (k) . Therefore, P CCT = P[C(0, 1) ≥ T] ≥ P C(0, 1) ≥ T (1) = P (1) , and P CCT = P[C(0, 1) ≥ T] ≤ P C(0, 1) ≥ T (k) = P (k) .
Remark 1 Theorem 1 implies that the CCT test can't provide stronger evidence (i.e., smaller p-value) to reject the global null hypothesis than the strongest one that against an individual null hypothesis.

Remark 2
Because of the fact stated in Remark 1, CCT is not preferable for combining independent p-values.

Remark 3
The same result as in Theorem 2 was also proved in other papers 15,16 , but they made some distributional assumptions about the T ′ i s . Here we provide a new proof without any additional assumptions (i.e., under truly arbitrary dependency structures of the p-values to be combined).

Remark 4
Theorem 2 proves that CCT can control type I error rate at small significance level for arbitrary dependency structures of the p-values to be combined. However, it may not be powerful, or even powerless, under some conditions. For instance, if p 1 + p 2 = 1 (e.g., p 1 and p 2 are p-values from left-and right-sided t-test), then the test statistic from the CCT will be 0 and its p-value p CCT = P[C(0, 1) ≥ 0] = 0.5 . Therefore, for any significance level less than 0.5, the power value will be 0 (i.e., the type II error rate will be 1). Interestingly, under this condition, the MinP test gives the same p-value obtained from the two-sided test. This simple example also indicates that the main result, Theorem 1 of Liu and Xie 15 , may not be valid any longer if the assumptions made in their paper are violated. In other words, CCT is not always powerful to combine p-values under arbitrary dependency structures.
From the construction of the test statistic, we see that CCT may gain some power if all the p-values to be combined are small and/or positively correlated. However, as mentioned above CCT may be much less powerful than the MinP test, which is known for its robustness but conservative in general. To incorporate the good properties from both CCT and MinP, we propose the following two tests. The first one is called MinP-CCT-MinP (MCM), whose p-value is calculated as: p MCM = 2 min{p CCT , p MinP , 0.5} , where p MinP is obtained by applying MinP to the original p-values to be combined. The second one is called CCT-MinP-CCT-(CMC), whose p-value is calculated as: p CMC = CCT{p CCT , p MinP }.
Since both CCT and MinP can control type I error for small significance level, and MCM and CMC are the MinP and CCT to combine their p-values, respectively, we have the following result.

Results
To study the performances of MCM and CMC, we conduct a comprehensive simulation study by comparing these tests with CCT and MinP. We also apply the new tests to some real data to demonstrate their usefulness. www.nature.com/scientificreports/ Simulation study. In the simulation study, following the settings in Liu and Xie 15 , we assume the random vector X T = (X 1 , X 2 , . . . , X k ) has a multivariate normal distribution with correlation matrix � = (σ ij ) . For the correlation matrix, we consider three different models. Model 1 (AR(1) correlation, "Expo"): σ ij = ρ |i−j| for 1 ≤ i, j ≤ k , where ρ is a constant between 0 and 1. Model 2 (polynomial decay, "Poly"): σ ii = 1 and σ ij = 1 0.7+|i−j| r for 1 ≤ i � = j ≤ k. Model 3 (Singular matrix, "SiG"): Let A = (a ij ) be a k/5 × k matrix where a ij = d |i−j| and d is a constant between 0 and 1. Let For the above three models of the correlation matrix , we use different values for the parameters ( ρ, r, and d ). We also choose different numbers of p-values (i.e., k ) in the simulation study. To investigate how the tests control type I error rate, we simulate X ∼ MVN(0, �) with being in one of the three above models. For the power comparison, under the global alternative hypothesis H 1 , we assume a subset of the vector X has non-zero mean. Of those significant random variables, we also assume some of them have negative mean ( −µ ) and the rest have positive mean ( µ ). For each variable X i three different p-values, according to three types of individual alternatives ( µ i < 0, µ i > 0, and µ i = 0, respectively), are calculated: is the cumulative distribution function of the standard normal distribution. All the tests are then applied to the three sets of p-values. Table 1 displays the empirical type I error rate (divided by the significance level) for all of the tests applied to the left-sided p-values using different significance levels. All of the tests control type I error rate, except for CCT which may have slightly higher type I error rates when the preset significance levels are large, this pattern was also observed by Liu and Xie 15 . It is also noticeable that under some conditions, MinP, MCM and CMC may have lower type I error rates than expected. Similar patterns are observed when these tests are applied to the right-sided and two-sided p-values (data not shown). The similar patterns are also observed under other simulation settings (see Tables S1-S2 in supplementary materials). Real data application. We also applied the proposed tests to a real data application. Table 2 lists the estimated odds ratios (ORs) and the 95% confidence interval (CI) from a meta-analysis which includes 12 independent trials that examine the effect of patient rehabilitation designed for geriatric patients on functional outcome improvement, compared with usual care. An OR greater than 1 means the new treatment was better than the usual care. The data were taken from Figure 4 of Riley et al. 20 , part of the Figure 2 of Bachmann et al. 21 . The original meta-analysis was based on a random effect model as the Cochran's test for homogeneity indicated that the fixed effect model is not appropriate. However, a goodness of fit test also showed that the random effect model does not fit the data either and the p-value combination method was suggested 22 . Based on the given estimated OR and CI, we can calculate the individual p-values from the 12 studies 22 . Denote U and L the upper and lower limits of the 95% CI, the test statistic can be approximated as t = ln(U × L)/ 4ln(U/L)/3.92 , whose asymptotic null distribution is N(0, 1) . The sample sizes of these 12 trials were relatively large, ranging from 108 and 1388, therefore, we can reasonably estimate their p-values using the asymptotic null distribution. For each study, three types of p-values, one-sided left, one-sided right, and two-sided under the three alternative hypotheses ( OR i < 1, OR i > 1, and OR i = 1, respectively), are calculated as shown in Table 2 and are used in the p-value combination tests. Table 3 displays the results of the combination tests applied to the three types of p-values described above. Each of the three tests (Min P, Fisher chi-square test, and the z-test) is used to combine independent p-values from all left-sided, all right-sided, and all two-sided, separately (columns 2-4 of Table 3). The resulting two dependent p-values from combining left-sided p-values and right-sided p-values are further combined using CCT, MinP, MCM, and CMC. Their p-values are listed in the last four columns in Table 3. For instance, the p-values through using the z-test for combining independent p-values obtained from the individual left-and right-sided tests are 0.99984 and 0.00016, respectively. The two dependent p-values are then combined using the CCT, MinP, MCM, and CMC, we get 0.50, 0.00031, 0.00063, and 0.00063, respectively. Interestingly, when we combine the independent two-sided p-values, the p-values are 0.013, 0.0068, and 0.075 from the MinP, Fisher Chi-square, and z-test, respectively. All of them are greater than the p-values obtained by the MinP, MCM, and CMC tests combining two dependent p-values, while the CCT test has a large p-value of 0.5. This result indicates that appropriately combining two dependent p-values, each obtained through combining independent p-values from the same direction, is preferred to combining independent two-sided p-values.

Discussion and conclusion
We have shown that when the significance level is small the recently proposed p-value combination test CCT can control type I error rate for p-values under arbitrary dependency structures. However, we also showed that under some conditions, CCT may be less powerful or even powerless at all. This could happen, for instance, in a genetic study, a genetic risk factor could be protective for some subpopulations, which will result in some small p-values and also some large p-values to be combined. On the other hand, the commonly used test MinP can also control type I error rate under all conditions and may be more or less powerful than CCT under some conditions. To improve the detection power, we proposed two new tests, MCM and CMC. Through a comprehensive simulation study and real data application, we showed that MCM and CMC can control type I error rate and are more robust than CCT and MinP. The proposed tests, MCM and CMC, take advantage of the two methods, CCT and MinP and therefore will maintain reasonable power under all situations. They can be applied when the dependency structures of p-values to be combined are unknown. As theorem 1 shows, CCT (and also MinP, MCM, and CMC) can not obtained a p-value smaller than the smallest one of the p-values to be combined. This result suggests that when we combine independent p-values, we should consider other more powerful tests, such as the Fisher chi-square test, z-test and others 23 . Approaches for combining p-values have been extensively used in statistical practice and have significant effects on data

Data availability
All data are presented in the paper and no additional data are available.   Table 3. Results from the tests applied to a real data application.